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Abstract 

The increasing quantity and quality of functional genomic information motivate the assessment and integration of these 
data with association data, including data originating from genome-wide association studies (GWAS). We used previously 
described GWAS signals ("hits") to train a regularized logistic model in order to predict SNP causality on the basis of a large 
multivariate functional dataset. We show how this model can be used to derive Bayes factors for integrating functional and 
association data into a combined Bayesian analysis. Functional characteristics were obtained from the Encyclopedia of DNA 
Elements (ENCODE), from published expression quantitative trait loci (eQTL), and from other sources of genome-wide 
characteristics. We trained the model using all GWAS signals combined, and also using phenotype specific signals for 
autoimmune, brain-related, cancer, and cardiovascular disorders. The non-phenotype specific and the autoimmune GWAS 
signals gave the most reliable results. We found SNPs with higher probabilities of causality from functional characteristics 
showed an enrichment of more significant p-values compared to all GWAS SNPs in three large GWAS studies of complex 
traits. We investigated the ability of our Bayesian method to improve the identification of true causal signals in a psoriasis 
GWAS dataset and found that combining functional data with association data improves the ability to prioritise novel hits. 
We used the predictions from the penalized logistic regression model to calculate Bayes factors relating to functional 
characteristics and supply these online alongside resources to integrate these data with association data. 
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Introduction 

Genome-wide association studies (GWAS), which investigate 
the association between genetic variation and phenotypic traits, 
have identified many genes associated with human diseases [1]. 
However, despite considerable advances, much of the estimated 
heritability remains unaccounted for. Purcell et al. [2] showed that 
single nucleotide polymorphisms (SNPs) from GWAS with sub- 
threshold p-values account for a considerable proportion of the 
variance in independent samples suggesting that they are enriched 
for causal SNPs or their proxies. The issues of small sample size, 
low minor allele frequency, and lack of linkage disequilibrium (LD) 
between genotyped SNPs and the un-genotyped causal SNPs 
present challenges to detecting truly causal variants among near- 
significant genetic associations. 

Emerging experimental data from various sources have 
suggested that the functional characteristics of specific genomic 
regions, such as histone modifications, DNase I hypersensitive 



sites, transcription factor binding sites, and expression quantitative 
trait loci (eQTL) among others, could offer biological explanations 
for many variants found to be associated with disease (for example: 
[3,4,5]). In September 2012, a series of publications from the 
ENcyclopedia of DNA Elements (ENCODE) Project Consortium, 
had the key message that much of the human genome, including 
non-coding and intergenic regions, overlaps with at least one 
functional element that may be active in certain cell types, under 
defined physiological conditions [6]. Furthermore, putative 
disease-causing variants show significant enrichment for multiple 
functional characteristics from the ENCODE Project [7]. For 
example, GWAS variants or variants with which they are in 
perfect LD are more frequently localized to DNase I hypersen- 
sitive sites than would be expected by chance [8] . 

Various tools are available that allow one to summarise the 
functional characteristics of variants in a given region. For 
instance, Boyle et al. developed RegulomeDB, a web-based 
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SNP Prioritization Using Functional Characteristics 
Table 1. Summary statistics for the functional characteristics in the clumped non-phenotype specific analysis. 



Frequency 
Frequency of annotation 

of annotation in GWAS p value Odds 95% Confidence 

Description in GWAS hits non-hits (Fisher's exact test) Ratio interval 



splice 


0.002 


0.002 


0.142 


1.26 


0.78-2.02 


non-synonymous 


0.022 


0.007 


2.38E-38 


3.10 


2.67-3.59 


DNase Clusters 


0.193 


0.141 


1.87E-39 


1.46 


1.38-1.54 


GTEx eQTLs 

(all 7 experiments together) 


0.020 


0.007 


1.69E-31 


2.92 


2.50-3.41 


UK brain eQTLs 


0.108 


0.081 


2.19E-18 


1.37 


1.28-1.47 


UCSC Genes 


0.422 


0.357 


7.36E-35 


1.31 


1.26-1.27 


PhyloP* 


0.217 


0.172 


6.56E-27 


1.34 


1.27-1.41 


PhastCons* 


0.243 


0.202 


3.63E-20 


1.27 


1.20-1.33 


BroadHistone- H3k4Me1 


0.637 


0.566 


2.20E-40 


1.35 


1.29-1.41 


BroadHistone- H3k4Me3 


0.509 


0.434 


1.63E-43 


1.35 


1.30-1.41 


BroadHistone- H3k27ac 


0.587 


0.503 


1.28E-53 


1.48 


1.34-1.46 


Txn Factor ChIP 

(if annotation for any TF) 


0.511 


0.456 


5.26E-24 


1.25 


1.10-1.14 


miRNA 


1.12E-4 


7.00E-5 


0.116 


1.70 


0.24-12.15 


Gencode-Txn start sites 


0.003 


0.002 


0.012 


1.64 


1 .08-2.49 



*As PhyloP and PhastCons conservation scores were left as quantitative measures, the frequencies reported for those characteristics represent the presence of a 
conservation score (ie. score>0). 
doi:1 0.1 371 /joumal.pone.00981 22.t001 



interface that provides an easily interpretable score from an 
amalgamation of many functional characteristics derived from a 
variety of sources to annotate non-coding variants [9]. Other 
programs such as HaploReg [10] and SNPnexus [11] perform 
similar functions and account for LD. Although these programs 
provide facile access to summary information about the location of 
variants, they are only able to provide a relatively arbitrary and 
crude ranking of functional significance. The ranking scale used in 
RegulomeDB is based on the number of categories into which a 
variant falls with the highest scores given to those variants that fall 
into both an eQTL and a transcription factor binding site, 
regardless of cell type or specific transcription factor. 

The central challenge in the interpretation of genetic associa- 
tions lies in the processing and meaningful integration of a hugely 
diverse range of information. Having derived a score for a region 
containing a candidate variant, it has to be integrated with 
association evidence. We proposed the use of empirically derived 
weightings within a Bayesian framework [3]. More recently 
Schork et al. suggested the use of stratified False Discovery Rate 
(sFDR) and Darnell et al. proposed multi-thresholding in a 
manner that they say is equivalent to varying the significance 
threshold at each marker depending on the prior information 
[12,13]. In order to implement these approaches one needs to 
define appropriate weights. For instance, Schork et al. [12] used 
an LD-weighted scoring algorithm, and Kindt et al. [14] recently 
published a multivariate logistic regression approach. However, 
neither of these approaches is easily scalable to the very large 
number of functional characteristics that are becoming available. 

The primary objectives of this study are to describe an 
empirically justified method to identify which functional charac- 
teristics are best correlated with GWAS hit SNPs, to provide clues 
to the etiology of such traits, and to develop and implement a 
method to incorporate functional characteristics with statistical 
information in association studies. To achieve these objectives we 



use a machine learning approach, elastic net (a regularized logistic 
regression), to predict causality of a SNP based on information 
from 439 functional characteristics. We explore models based on 
all GWAS significant SNPs and also subsets of significant SNPs 
selected on the basis of phenotype and p-value. Functional 
characteristics are considered individually or in groups. We report 
a) the accuracy of the predictions to demonstrate the utility of the 
method and to investigate the behaviour of the different models, b) 
the frequency, correlation between and coefficients of the 
functional characteristics providing insight about their functional 
relevance to disease, c) a prediction score for each SNP, and d) 
details of how to combine this score with association statistics in a 
formal Bayesian framework. 

We provide online scripts that can be employed so the method 
can be used by other researchers using additional functional 
characteristics (http://www.camh.ca/en/research/ 
research_areas/genetics_and_epigenetics/Pages/Statistical- 
Genetics.aspx). For the best models we provide the probability of 
causality (the prediction score) for each SNP, the corresponding 
Bayes factor (BF imnot ) and scripts to combine BF ann()t with GWAS 
association signals. 

Results 

Functional Enrichment in GWAS Hits 

Frequencies of functional characteristics in GWAS hits com- 
pared to non-hits were compared using Fisher's exact test. Our 
analyses indicate that GWAS hits are enriched for most functional 
characteristics compared to GWAS non-hits, except for splice sites 
and micro RNA (miRNA) targets, perhaps due to the very low 
frequency of these two classes of functional characteristics 
compared to the others (Table 1 and Table 2). 

The histone modification data from the Broad Institute had the 
highest frequencies in GWAS hits, and the lowest p-values for 
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SNP Prioritization Using Functional Characteristics 
Table 2. The mean score per SNP across all functional characteristics, classified by SNP type and functional variable type. 



Clumped Separated 

All SNPs 2.7 17.7 

Hits 3.2 24.1 

Non-hits 2.6 17.0 



doi:1 0.1 371 /journal.pone.00981 22.W02 

enrichment. Many functional characteristics, most notably 
miRNA, were very infrequent, but the general picture was that 
their frequency in GWAS hits was greater than in GWAS non- 
hits. 

We examined the correlations among the various functional 
characteristics (Figure 1 and Figure 2). The separated-variable 
analysis included measures of functional characteristics from 
different cell lines as individual factors, whereas the clumped- 
variable analysis grouped data from different cell lines for the same 
functional characteristic. The clumped analysis showed a strong 
correlation between the two conservation measures (PhyloP and 
PhastCons), as well as strong correlations among the three histone 
marks (H3k4Mel, H3k4Me3 and H3k27Ac), and to a lesser 
degree among the histone marks and transcription factor binding 
sites. The separated analysis revealed additional correlations 
among cell types investigated for the DNase I hypersensitive 



characteristics from Duke University, and to a lesser degree among 
the DNase I hypersensitive characteristics from the University of 
Washington, and between these two groups. These results 
highlight the issue of correlations among functional characteristics, 
many of which simply represent the same genomic feature, for 
example a promoter element measured by different technologies. 
One advantage of elastic net as a regularized logistic regression 
method is its ability to accommodate highly correlated variables. 

Predictive Accuracy of Functional Characteristics 

We fitted predictive models for GWAS hit status via elastic net, 
using clumped and separated functional variable sets, using high- 
confidence (p<5xl0 8 ) and low-confidence (p<10 GWAS 
hits, and using all GWAS hits ("non-phenotype specific") as well as 
hits classified according broad phenotype areas. We primarily 




Figure 1. Heat map of correlations among the clumped functional characteristics. High correlations are seen between the two 
conservation measures PhyloP and PhastCons (represented as Phylo and Phast, respectively). Correlations are also seen among the histone 
modifications, H3k4Me1, H3k4Me3 and H3k27Ac (Me1, Me3 and Ac, respectively.) Transcription factor binding sites also show a correlation with the 
histone modifications, [spl i = splice sites, Nons = nonsynonymous SNPs, DHs = DNase I hypersensitive sites, GTEx = cis-eQTL data from the GTEx 
Consortium, UK = cis-eQTL data from the UK Brain Consortium, Phylo = PhyloP conservation, Phast = PhastCons conservation, Me1 = H3K4Me1 histone 
modification, Me3 = H3K4Me3 histone modification, Ac = H3K27Ac histone modification, TF = transcription factor binding sites, RNA = micro RNA 
targets, Gene = transcription start sites from Gencode]. 
doi:10.1371/journal.pone.0098122.g001 
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Figure 2. Heat map of correlations among the separated functional characteristics. A full list of the numbered characteristics is provided in 
Table SI. The white box in the bottom left corner corresponds to high correlation among the histone modifications. The less defined white area 
spanning from 72 to 21 9 on the x axis corresponds to correlation among the transcription factor binding sites, which also show some correlation with 
the histone modifications. The white box from 220 to 31 9 on the x axis corresponds to a high correlation among the different cell types for the DNase 
I hypersensitivity characteristic from Duke University. The less refined white box from around 320 and onwards on the x axis corresponds to the 
DNase I characteristics from the University of Washington. The plot also shows some correlation among the DNase I characteristics from both groups. 
doi:1 0.1 371 /journal. pone.0098122.g002 



investigated predictive accuracy in a separate test set that was not 
involved in the fitting of the models. 

For all of our fitted models, the area under the curve (AUC) of a 
receiver-operating characteristic (ROC) curve was similar in the 
test and training sets, suggesting that the models had not been 
over-fitted (File SI, Part A). 

We found that the ROC curves for both the separated and 
clumped analyses had similar AUCs: for instance 0.58 in the test 
set for the non-phenotype specific clumped analysis and 0.59 in 
the test set for the separated analysis. 

Two analyses emerged as most predictive based on integrating 
results from ROC curves, positive predictive values, and 
histograms of the probabilities of causality (the prediction scores). 
These were the analyses based on non-phenotype specific and the 

Table 3. Areas under fitted ROC curves. 



autoimmune GWAS analyses. Best results were obtained from 
analyses using high-confidence GWAS hits. Results for clumped 
and separated functional variables were very similar (Table 3 and 
Figure 3). 

We also investigated positive predictive values (PPVs) and 
histograms of the probability of causality (prediction score). PPV 
estimates could not be obtained due to insufficient data (a limited 
number of true hits correctly identified as hits at a particular 
prediction value threshold) for the phenotype specific analyses 
since these analyses contain only a subset of all GWAS hits. As a 
result, PPVs were only plotted for the non-phenotype specific 
analyses (Figure 4). PPVs appear to be highest for the analysis 
using all GWAS hits compared to the analysis using the high- 
confidence hits when defining hits as those variants with a 







Non-phenotype specific 


Brain-related 


Cancer 


Cardiovascular 


Autoimmune 


N 


3227 (8219) 


348 (1741) 


300 (607) 


369 (716) 


570 (863) 


AUC clumped 


0.67 (0.58) 


0.61 (0.52) 


0.67 (0.60) 


0.69 (0.61) 


0.71 (0.67) 


AUC separated 


0.69 (0.59) 


0.62 (0.51) 


0.68 (0.60) 


0.66 (0.61) 


0.75 (0.71) 



Main values are for analyses of high-confidence GWAS hits. Values in parentheses are for all SNPs in the GWAS Catalogue. 
doi:1 0.1 371/joumal.pone.00981 22.t003 
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False positive rate 

Figure 3. Receiver-operating characteristic (ROC) curves for analyses of clumped functional variables and high-confidence GWAS 

hits. ROC curves were obtained from a separate test set. 

doi:10.1371/journal.pone.0098122.g003 
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Figure 4. Proportion of correctly identified hits in the test data (positive predictive values). In the non-phenotype specific analyses at 
various cut-offs for defining hits: SNPs with predictive values of greater than 0.5, 0.6, 0.7, 0.8, or 0.9. Note that results are only plotted for those 
predictive value thresholds in which there are at least 1 1 hits correctly identified. 
doi:1 0.1 371 /journal.pone.00981 22.g004 
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Figure 5. Predicted values for true GWAS hits and non-hits in the test data. Panels show results of clumped-variable analyses on high- 
confidence GWAS hits for brain-related [a], cardiovascular [b], cancer [c], autoimmune [d], and non-phenotype specific hit sets [e], and for all hits in 
the GWAS Catalogue for the non-phenotype specific hit set [fl. 
doi:1 0.1 371 /journal.pone.00981 22.g005 



prediction score of greater than 0.5, 0.6, or 0.7. There was 
insufficient data at the higher thresholds for declaring a positive hit 
for the analysis based on all GWAS hits. Yet sufficient data was 
available at the higher prediction value thresholds for the analysis 
using the subset of high-confidence hits, demonstrating a broader 
spread in prediction values for that analysis compared to the 
analysis on all GWAS hits. 

Histograms of the probability of causality in the test data 
allowed visualization of the separation (or non-separation) of true 
hits versus non-hits. We found that for the non-phenotype specific 
analysis and for the autoimmune analysis, the use of high- 
confidence GWAS hits in the training data improved the 
separation of true hits from non-hits in the test data (Figure 5). 



The results from the histograms of the predicted values showed 
a broader spread in the non-phenotype specific clumped analysis 
on high-confidence GWAS hits compared to the analysis using all 
hits. The former separated true hits from non-hits better than the 
latter, with the modes of the two distributions distinct. These 
results suggest that the weighted elastic net procedure was 
successful in producing models that performed well in identifying 
true hits as well as in identifying true non-hits. While we could not 
obtain reliable PPV estimates for the autoimmune analysis due to 
insufficient data, the separation of non-hits from hits in the 
histogram was taken as sufficient evidence that the high area under 
the ROC curve for the autoimmune clumped analysis was also due 
to positive predictive power. 
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Figure 6. Coefficients of the functional characteristics for the two best analyses. The figure shows the coefficients from the clumped 

analysis on high-confidence GWAS hits for the non-phenotype specific versus the autoimmune model. 

doi:10.1371/journal.pone.0098122.g006 



Results will only be provided for the non-phenotype specific and 
the autoimmune clumped analyses, the two models that were 
deemed to be reliable based on the predictive accuracy measures. 
For the non-phenotype specific clumped analysis, the highest 
Bayes factor for annotation (11.95) was obtained for rs 1 1 1 77, 
which is a known GWAS hit associated with osteoarthritis on 
chromosome 3. It had a predicted value of 0.93. This SNP held all 
functional characteristics except three low-frequency characteris- 
tics: splice sites, miRNA targets, and Gencode transcription start 
sites. Nine percent of the variants with the top 500 Bayes factors 
were known GWAS hits. The frequency of hits in the test set data 
was 4.1%. The mean and median of the predicted values for the 
true hits in the test set were higher than those for the true non-hits 
(for hits: mean = 0.54, standard deviation = 0.13 and medi- 
an =0.54; for non-hits: mean = 0.46, standard deviation = 0.12 
and median = 0.44). 

For the autoimmune clumped analysis, the SNP with the 
highest Bayes factor was the same as for the non-phenotype 
specific clumped analysis, rs 1 1 1 7 7 . 



Investigation of the Relative Importance of Different 
Functional Characteristics 

The importance of a particular functional characteristic in 
predicting whether or not a SNP is more probable to be a GWAS 
hit is assessed by means of the magnitude of the coefficient 
assigned to the characteristic. In both the non-phenotype specific 
and autoimmune analyses we note that the nonsynonymous SNP 
functional characteristic had one of the highest coefficients 
(Figure 6). (The coefficients for both models are provided in 
File SI, Part B. Confidence intervals cannot be easily calculated 
for coefficients from elastic net, and so to estimate standard error 
for the coefficients we performed multivariate logistic regression. 
Those results are also in File SI, Part B.) GTEx eQTLs had the 
highest coefficient in the autoimmune analysis. 

Investigating Functional Predictions in the Context of 
known GWAS 

We investigated: schizophrenia (SZ) from a meta-analysis 
GWAS involving the first sample from the Psychiatric Genomics 
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Figure 7. Quantile-quantile plots stratified by predicted values for SNPs in real GWAS. All GWAS SNPs (in grey) for a schizophrenia GWAS 
from PGC1 with a Swedish sample [a], a systolic blood pressure GWAS from ICBP [b], and a height GWAS from GIANT [c]. The non-grey lines show 
plots for SNPs binned according to their predicted value from the non-phenotype specific model. 
doi:1 0.1 371 /journal.pone.00981 22.g007 



Consortium (PGC1) combined with a Swedish sample [15], 
systolic blood pressure (SBP) from the International Consortium 
for Blood Pressure (ICBP) [16], and height from Genetic 
Investigation of Anthropomorphic Traits (GIANT) Consortium 
[17]. The studies analyzed over 35,000 cases and 47,000 controls, 
200,000 individuals and, and over 180,000 individuals, respec- 
tively. 

For each study, we stratified the quantile-quantile plots 
according to predicted value bins (Figure 7). We found that 
SNPs with higher predicted values from the non-phenotype 
specific clumped analysis tended to deviate more from the line 
corresponding to the overall GWAS, in favour of more association 



signals. Similar results were obtained for all three GWAS 
analyzed: schizophrenia, systolic blood pressure and height. The 
pattern remained when only the GWAS SNPs present in the test 
set were plotted, and also when prediction values were obtained 
from models derived from excluding the genome-wide significant 
SNPs in the training set for each GWAS respectively. (Results 
shown in File SI, Part C). 

We obtained summary data obtained from a psoriasis GWAS 
study from Strange et al. [18]. We then selected 15 SNPs that were 
subsequendy discovered in a meta-analysis [19]. Using summary 
association statistics from the Strange et al. study we derived Bayes 
factors for association (BF assoc ) and Bayes factors based on 



PLOS ONE | www.plosone.org 



8 



May 2014 | Volume 9 | Issue 5 | e98122 



SNP Prioritization Using Functional Characteristics 



association data combined with the annotation of functional 
characteristics (BF assoc *BF annot ) for each SNP. We ranked the 
SNPs according to BF asso(: , and ranked them again according to 
BF assoc *BF anmrt to determine whether annotating SNPs with their 
functional characteristics improved their rank (larger Bayes factors 
were assigned smaller ranks). BF anmrt values were derived from the 
non-phenotype specific clumped analysis using high-confidence 
GWAS hits. As negative controls, we took 12 independent sets of a 
random 1 5 SNPs (which were not in high LD with any of the 1 5 
hits and had similar p-values to the hits) and compared the 
difference in the sum of ranks based on BF ass<)<: versus 
BF assoc *BF anmrt . The procedure was repeated using BF amlot derived 
from the autoimmune clumped analysis. 

Of the 15 true psoriasis hit SNPs, 7 had better ranks based on 
BF ass0( *BF anmrt compared to association information on its own 
(BF assoc ). The difference of the sum of ranks assigned to the 15 hits 
was nearly 48,000 based on BF asso( .*BF annot compared to BF assoc , 
with the former having the lower sum (better ranks). Many of the 
hit SNPs had very large ranks based merely on the association data 
(>3000), which was also the case for ranks based on BF assoc *B- 
Fannot) DU t the trend was in the right direction with better ranks 
obtained when combing the association information with the 
annotation of functional characteristics. Of the 1 2 random sets of 
1 5 independent SNPs, the trend was in the opposite direction for 
10 of the sets (with SNPs having better ranks based on BF ass<)<: 
alone). Of the remaining 2 sets, one of them had the same number 
of the SNPs with improved ranks based on BF ass(K *BF annot 

compared to 111) as did the analysis with the actual hits (7 

out of 15), and the other random set had 8 SNPs that showed 
improvement. However, for those random SNP lists the difference 
in the sum of ranks from BF ass<)<: compared to BF asso<: *BF annot was 
less than half of the improvement of ranks seen for the 15 hits. 
Comparable results were seen when using BF assoc based on the 
autoimmune clumped analysis. The difference between the sum of 
the ranks for BF annot compared to BF asso( *BF annot was over 
49,000, with improved ranks of the hits based on the BF assoc *B- 
Fannot ranks. Of the random lists the largest difference in the sum 
of ranks from BF assoc compared to BF ass(x *BF amlot was less than a 
third of the improvement of ranks seen for the 15 hits. 

Discussion 

The release of major genome wide datasets such as ENCODE 
and NIH Roadmap projects, offers an excellent opportunity to re- 
assess the existing GWAS corpus and draw conclusions about 
which functional characteristics in the human genome are most 
likely to indicate causality in association studies. We previously 
considered Bayes factors based on a limited set of functional 
characteristics, considering each functional characteristic sepa- 
rately [3]. Here we have extended our Bayesian framework by 
developing Bayes factors for multiple functional characteristics, 
considering all functional characteristics joindy. We used a 
regularized logistic regression to fit predictive models allowing 
for large numbers of both qualitative and quantitative functional 
characteristic data. We performed our analysis under a wide 
variety of conditions, including phenotype specific analysis for 
autoimmune, brain-related, cancer, and cardiovascular disorders. 

Our results confirm previous findings of differences in functional 
enrichment in GWAS hits compared to non-hits, which provided a 
rationale for utilizing functional characteristics as predictors of 
SNP causality. We found that using high-confidence GWAS hits 
(p<5xl0 -8 ) as a classifier resulted in more predictive power. 
However, if the number of GWAS hits that are available for 
training are too low, then the predictions become imprecise. This 



was a reoccurring theme for many of the phenotype specific 
analyses. The separation between true GWAS hits and non-hits in 
the test set, in addition to the AUC, should be used to assess the 
predictive power of a model. Using those methods we found that 
the non-phenotype specific and the autoimmune analyses on 
clumped variables using high-confidence GWAS hits were most 
reliable. For instance, although the AUCs were slighdy higher for 
the separated analyses, the classification of true GWAS hits and 
non-hits was better in the clumped analysis, suggesting that the 
clumped analysis may provide more accurate predictions. The 
benefit of the separated analysis is that it allows researchers to 
identify characteristics specific to certain conditions, for example 
specific cell types, which can be useful for planning further 
investigations, but the increased number of variables and sparsity 
of the data reduces the power of this type of analysis. 

While our study has demonstrated that relevant functional 
information is indeed predictive for identifying GWAS hits, and 
that Bayes factors incorporating this functional information rank 
known GWAS hits better than Bayes factors based on association 
information alone, the improvements based on current informa- 
tion (for example, in the psoriasis GWAS we analyse) are marginal. 
However, we outline reasons below to argue that the benefit of 
adding functional information to analyses of causal variant 
discovery will increase in the future. 

A limitation to the study is the restricted amount of tissue- or 
cell-specific data, especially in light of the findings that enrichment 
of disease-specific GWAS hits can differ in certain cell types, for 
example for DNase I hypersensitive sites [8]. Incorporating 
additional functional characteristics, for example those from 
relevant tissue types, will likely improve the understanding of 
which characteristics are associated with GWAS hit SNPs, 
especially for the phenotype specific analyses. Furthermore, other 
functional characteristics, such as further histone marks and other 
epigenetic modifications, could be incorporated to improve the 
models. 

The current number of GWAS hits in the GWAS Catalogue 
makes it challenging to sub-divide hits into phenotype specific 
traits. However, preliminary results showing differences in the 
coefficients for the functional characteristics suggest that as the 
number of GWAS hits grows, a phenotype specific approach from 
which to derive Bayes factors for prioritization could be more 
biologically relevant than simply an approach that combines all 
GWAS hits together. Interestingly, although it was one of the 
largest lists, the brain-related list did not have a greater predictive 
power than expected by chance. This finding only serves to 
reinforce the widely appreciated complexity of brain-related 
disorders. Nevertheless, schizophrenia GWAS significant SNPs 
showed enrichment of SNPs with high predicted values from the 
model, as did SNPs associated with systolic blood pressure or 
height. 

Using manually curated phenotype lists as done here may not 
be the best option. Using lists that are more reproducible, such as 
those based on the Experimental Factor Ontology (EFO) 
definitions, may be more appealing. However, most of the lists 
created using the EFO definitions were relatively small, covering 
less than 10% of the total GWAS hits on the common genotyping 
arrays, and thus this method of classifying GWAS hits was deemed 
to be not feasible, but may be possible in the future as the size of 
GWAS Catalogue grows still larger. 

The coefficient for non-synonymous SNPs was the highest in the 
non-phenotype specific analysis and a close second in the 
autoimmune analysis. This result suggests that being a variant in 
a gene that causes a protein alteration is an important indicator of 
whether or not a genetic variant will be truly associated with a 
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phenotype. The result agrees with the findings that the top 
associated SNPs and also those that are nominally associated with 
a phenotype are more likely to overlap genes than non-GWAS 
SNPs [20]. Our analysis appears to underscore the primacy of 
non-synonymous variation as a leading mediator of functional 
variation in the human genome. Although this result is perhaps 
unsurprising, it lends support to many of the gene-focused, rare- 
variant strategies that have been recently employed (for example: 
[21,22,23]. However, depending on the inclusiveness of promoter 
regions in chip design, these strategies may or may not capture 
other high scoring variant types, such as eQTLs and histone 
marks, which collectively account for more GWAS hits than non- 
synonymous variants alone. These patterns highlight a possible 
need for follow-up on non-coding variation chips. GTEx eQTLs 
came up as the most important factor in the autoimmune analysis. 
Two of the experiments analyzed eQTLs from lymphoblastoid 
cells, which may explain the importance of this functional 
characteristic in the autoimmune traits. 

We have shown that our method can be used to calculate Bayes 
factors for annotation (BF annot ). These can be applied to GWAS 
data to prioritise near-significant variants for follow-up based on 
the likelihood of being causal in light of their functional 
characteristics. The method takes LD into account, and uses 
information from the March 2012 release of the 1000 Genomes 
Project to map relevant annotation information from all variants 
in high LD, including both SNPs and indels. In addition to being 
used for variant prioritization of GWAS data, the methodology 
could be applied in the future to the prioritization of variants from 
fine mapping and sequencing studies. Here, the question arises as 
to whether the models described here, which were created based 
on common variation, could be applied to rare variation. In time, 
larger databases of true causal variation, including rare variation, 
will allow our method to be applied with increasing accuracy. 

Methods 

Representative GWAS SNPs 

To represent the characteristics of a typical GWAS panel, 
markers from the Affymetrix Genome-Wide Human SNP Array 
6.0, the Illumina Human lM-Duo Genotyping BeadChip, and the 
Illumina HumanOmni 1 -Quad BeadChip were downloaded from 
the UCSC genome browser, using the table browser tool [24]. The 
union of these three arrays consisted of 1,936,864 unique SNPs 
from the 22 autosomes. Because of its unique LD and genie 
properties, the MHC region (chr6: 29624809-33160245 on build 
37) was excluded from downstream analyses. 

LD proxies or "tagging" SNPs (r 2 > = 0.8) for the GWAS panel 
SNPs were identified using VCFtools [25] based on data from the 
(N= 379) Europeans (Phase I, version 3, March 14, 2012) in the 
1000 Genomes Project [26]. 

GWAS "non-hits" were defined as all those SNPs in our union 
GWAS set which were neither a GWAS "hit" (see below), nor in 
high LD (r 2 > = 0.8) with a GWAS hit. 

GWAS Hits 

To obtain a set of SNPs (and their LD proxies) with good prior 
evidence of causality, we downloaded the Catalogue of Published 
Genome-wide Association Studies from the National Human 
Genome Research Institute (NHGRI) (http://www.genome.gov/ 
gwastudies) [1] on August 6, 2013. This catalogue contains a list of 
SNPs that have been shown to be associated with a particular trait 
in a GWAS at a suggestive p-value<10~ 5 . We removed SNPs in 
the Catalogue that were not present in the representative GWAS 



set defined above, and similarly removed SNPs on the sex 
chromosomes or in the MHC region. 

All SNPs in our GWAS hit and GWAS non-hit sets, along with 
all their LD proxies, were annotated with all the functional 
characteristics defined below. Each GWAS hit and non-hit SNP 
was then given the maximum value for each functional 
characteristic found across all its LD proxies. 

Functional Characteristics 

We acquired functional data from a variety of sources (Table 4). 
A full list is provided in Table SI. Much of the data was 
downloaded from the UCSC genome browser using the table 
browser tool [24]. Additionally, a substantial proportion of the 
data was derived from the ENcyclopedia of DNA Elements 
(ENCODE) Project Consortium, which developed and imple- 
mented a range of experimental techniques with the aim of 
identifying the functional regions of the human genome, 
particularly including non-coding regions [27]. Data from this 
project that were used included transcription factor binding sites 
(TFBSs), three histone modifications (H3K4Mel, H3K4Me3, 
H3K27Ac), and DNase I hypersensitive sites. H3K4Mel is 
associated with enhancers and DNA regions downstream of 
transcription starts, and often found near regulatory elements; 
H3K4Me3 is associated with promoters active or poised to be 
active, and often found near promoters; H3K27Ac thought to 
enhance transcription possibly by blocking repressive histone mark 
H3K27Me3, and often found near active regulatory elements. The 
technologies for identifying the functional characteristics men- 
tioned above were chromatin immunoprecipitation followed by 
sequencing (ChlP-seq). 

DNase I hypersensitive sites are regions in the genome with high 
affinity of being cleaved by the DNase I enzyme. The University of 
Washington (UW) group identified DNase I hypersensitive sites 
using Digital DNase I. This method involves DNase I digestion of 
intact nuclei, isolation of DNasel "double-hit" fragments, and 
direct sequencing of fragment ends. Peaks are regions that are 
enriched in the captured fraction of the DNA suggesting they are 
occupied by the protein of interest (any score >0). The DNase I 
hypersentitive sites from the Duke University group were 
identified using a synthesis of Formaldehyde-Assisted Isolation of 
Regulatory Elements (FAIRE) and ChlP-seq experiments. We 
used a binary variable to indicate whether a SNP was within a 
peak. 

Two types of conservation scores from 46 placental mammals 
(PhyloP and PhastCons) were incorporated. Both PhyloP and 
PhastCons scores are derived using phylogenetic hidden Markov 
models. These two measures have their own advantages. PhyloP 
scores do not take into account conservation at neighbouring sites, 
whereas PhastCons estimates the probability that each nucleotide 
belongs to a conserved element. 

Expression quantitative trait loci (eQTLs), which are variants 
that are correlated with gene expression, were included. In 
particular those that fall within 2 Mb (+/ — 1 Mb upstream and 
downstream) (cis-eQTLs) of the gene of interest were used. These 
data were derived from the NCBI-hosted GTEx Browser (http:// 
www.ncbi.nlm.nih.gov/gtex/GTEX2/gtex.cgi) [28,29,30,31] and 
the UK Brain Expression Consortium (www.braineac.org) [32]. 

Summary information concerning the location or function 
within a gene (coding-non-synonymous, coding-synonymous, 
splice site, untranslated regions, etc) was derived from dbSNP. 
Non-synonymous SNPs, were classified as those SNPs with one of 
the following characteristics: stop-gain (nonsense), missense, stop- 
lost, frameshift or inframe indel. Splice site regions were defined as 
being within five base pairs upstream and five base pairs 
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Table 4. Summary of functional characteristics. 



Functional characteristic analysed 


Description 


Number and detail of measures used in the analysis* 






Clumped 


Separated 


ENCODE data 


UW DNase 1 hypersensitive sites 


Data from digital 
DNasel methodology, 
Replication 1 samples; ("peaks") 


N/A 


122 


Duke DNase 1 hypersensitive sites 


Positions of open chromatin 
by FAIRE and ChlP-seq 
experiments; ("peaks") 


N/A 


100 


DNase Clusters (v2)** 


Stringent (FDR 1% threshold) 
for "peaks" of DNase 
1 hypersensitivity from 
uniform processing by the 
ENCODE Analysis Working 
Group of data from UW and Duke 


1 


N/A 


Txn Factor ChIP 


Transcription factor 
binding sites (TFBS) 
from ChIP Seq 
experiments; ("peaks") 


1 (presence or absence 
in any TFBS) 


148 (separated byTF, but not by cell 
type due to 
sparse data) 


Broad Histone - H3K4Me1, H3K4Me3, H3K27Ac 


All are assayed using ChlP-Seq; ("peaks") 


3 (each histone mark 
grouped by the 18 
cell types 
and/or conditions) 


54 (each histone mark separated 
by cell type 
and/or conditions) 


Conservation 


PhyloP 


Average scores can be calculated 
as the sum of scores divided 
by the number of valid data 
values in the block 
(scores range from 0.1 to 2.2910} 


1 


1 


PhastCons 


Average scores can be 

calculated as for PhyloP 

(scores range from 0.1 to 1.0 in this dataset) 


1 


1 


Expression quantitative trait loci 


eQTL- GTEx 


cis-eQTLs, p<1 x10~ 5 cut-off for variants 
within 2 Mb of the 
expressed gene. 


1 (any eQTL) 


7 (separated by dataset) 


eQTLs - UK Brain 


cis-eQTLs, FDR<1% cut-off for variants 
within 2 Mb of the 
expressed gene. 


1 


1 


Other characteristics 


UCSC Genes 


UCSC known Gene 


1 


1 


Splice sites 


Splice site region defined 
as —5 to +5 range around 
exon starts & exon ends 
of UCSC Genes 


1 


1 


Nonsynonymous SNPs 


Coding Nonsynonymous SNPs defined 
as stop-gain 
(nonsense), missense, 
stop-lost, frameshift or 
inframe indel 


1 


1 


TS miRNA sites 


Conserved mammalian microRNA regulatory 
target sites for 
conserved microRNA 
families 


1 


1 


Gencode 
Transcription 
start sites 


Based on the 
GENCODE 

Genes variable (version 17, June 2013) 


1 


1 



*All SNPs are annotated in a binary fashion indicating the presence or absence of a functional characteristic, except for the conservation scores, for which the SNPs are 
assigned a quantitative score. 

**The DNase Clusters v2 file was created by combining the UW and Duke DNase I data that have been uniformly processed and replicates merged. Stringent (FDR 1% 
thresholded) peaks of DNase I hypersensitivity from uniform processing by the ENCODE Analysis Working Group were applied. Grouping the UW and the Duke DNase I 
hypersensitive variables are not equivalent to the DNase Clusters v2 file, and thus we used the latter to represent DNase I hypersensitive sites in the clumped analysis 
due to the substantial efforts made to combine the data meaningfully. 
doi:10.1371/journal.pone.0098122.t004 
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downstream of the exon start site or the exon end site. The UCSC 
gene table was used to determine the exon start and end sites. The 
UCSC gene table is comprised of a set of gene predictions based 
on data from RefSeq, GenBank, the Consensus Coding Sequence 
(CCDS) variable, Rfam, and the Transfer RNA Genes variable. 
Additional characteristics used were 3' targets for microRNA 
(miRNA), and also transcription start sites as described by 
Gencode [33]. As miRNA targets are known to be substantially 
over-predicted, we used a conservative miRNA target dataset 
based on conserved mammalian microRNA regulatory target sites 
in the ?>' UTR regions of Refseq Genes, as predicted by the 
TargetScan algorithm (Human 5.1) [34]. 

All SNPs in our GWAS hit and GWAS non-hit sets, along with 
all their LD proxies, were annotated with all the functional 
characteristics defined above. Each GWAS hit and non-hit SNP 
was then given the maximum value for each functional 
characteristic found across all its LD proxies. 

Tests for Functional Enrichment 

Counts of GWAS hits and non-hits were categorized by 
annotation value and compared using Fisher's exact test. To 
verify that results were not unduly influenced by correlations (LD) 
among observations, we also conducted analyses in which genetic 
variants were "pruned" so that all SNPs have r 2 <0.8 with all other 
SNPs. The results of these analyses were very similar (data not 
shown). 

Heat maps were constructed using R [35] to compare 
correlations among the various functional characteristics. 

Regularized Logistic Regression via Elastic Net 

We used a regularized form of logistic regression known as 
elastic net to predict GWAS hit versus non-hit status on the basis 
of the functional characteristics we had collected. We first 
employed this method for a symposium on "Functional annotation 
of GWAS hits" that we organized for the American Society of 
Human Genetics in 2010. Elastic net is a form of machine learning 
first described by Zou and Hastie [36], and is implemented in the 
glmnet package [3 7] in R [35] . Briefly, regularization is achieved 
via the subtraction of a penalty term from the log-likelihood prior 
to maximization. The penalty term includes both a "lasso-like" LI 
component (the sum of the absolute values of all fitted coefficients) 
and a "ridge-like" L2 component (the sum of squares of all fitted 
coefficients). Two parameters, alpha and lambda, determine the 
relative importance of the L 1 versus the L2 term (alpha), and the 
overall importance of the penalty term in the maximization 
(lambda). Appropriate values for these parameters were found by 
10-fold cross-validation of the training set (see below). 

Due to the unbalanced nature of the data (many more GWAS 
non-hits than hits) we employed a weighting procedure in the 
logistic regression to balance the accuracy of prediction in both 
types of markers. We weighted all hits by (Nhits+N non _hi ts )/2Nhits 
and all non-hits by (N Mte +N non _ hiB )/2N non _ hitS! where N Uts and 
Nnon-hits denote the number of hits and non-hits, respectively, in 
the training set. This procedure has the effect of equalizing the 
importance of hits and non-hits in the logistic regression. 

We randomly selected 60% of our GWAS hits and non-hits to 
form our training set. The remaining 40% of the data (the test set) 
was used to assess the performance of the model using ROC 
curves and other measures. We repeated the machine learning 
modifying the percentage of the data used in the training and test 
sets, and all splits produced similar results (File SI, Part D). To 
diminish the possibility that the models are over-fit since the 
training of the data and tuning of the parameters were conducted 
on the same set, we created a 70%/30%, split where the 70% was 



further split into 60% and 40% for training the coefficients and 
tuning the parameters, respectively. The remaining 30% was used 
to test the model. Similar results were produced when the training 
and tuning were conducted in independent subsets. (File SI, Part 
D), and so the 60%/40% training/ test set split was pursued for the 
remaining analyses. 

The data was split into the training and test sets ten times using 
a random number generator. We found that the beta coefficients 
were consistent for all of the functional characteristics with the 
exception of those with the lowest frequencies (File SI, Part E). 

For the calculation of Bayes factors, we performed elastic net, 
using the same determined values of alpha and lambda, on the full 
GWAS hit and non-hit datasets. 

Predictive Accuracy 

We employed three methods to determine which models had 
the best predictive accuracy: ROC curves, positive predictive 
values, and histograms of the predicted values from the models. 

ROC curves show the sensitivity and specificity of a fitted 
model. Sensitivity is the probability of the model providing a true 
positive result (identifying a true GWAS hit in the test set). 
Specificity is the probability of the model providing a true negative 
result (identifying a true GWAS non-hit in the test set). An AUC of 
0.5 indicates a model of no predictive value, while an AUC of 1 
indicates perfect predictive power. The ROC curves were created 
using the ROCR package [38] in R. 

ROC curves do not reflect how well a model performs within 
each class given unbalanced data (a very large number of non-hit 
SNPs compared to hits). To capture this aspect we also 
investigated positive predictive values (PPVs), the proportion of 
SNPs with predicted probabilities of causality above a certain 
threshold (we investigated thresholds of 0.5, 0.6, 0.7, 0.8 or 0.9) 
that are true GWAS hits in the test set. Finally, we visualized class 
separation with histograms of the predicted probabilities of 
causality by class. 

Definition of Functional Variables and GWAS Hits 

A variety of functional characteristics were investigated as input 
variables. One, defined as the "clum ped" analysis, featured 
groups of functional characteristics, which were collapsed into a 
single summary variable. The "separated" analysis worked on all 
functional characteristics individually. 

We performed phenotype specific analyses in which the analyses 
oudined above were carried out using phenotype specific GWAS 
hits as classifiers. An autoimmune list, a brain-related list and a 
cardiovascular list were created using the GWAS Catalogue 
searching for terms relating to those phenotypes. Each list was 
then verified by an expert in the field. 

Additionally, the GWAS Catalogue was divided up into 
categories specified by the Experimental Factor Ontology (EFO) 
definitions; however, due to small numbers of SNPs in each 
category this mode of classification is not currently feasible for 
most of the subsets (File SI, Part F). Only the cancer list, which 
was the largest disease-relevant list, was used. 

We defined two sets of GWAS hits for downstream analysis, one 
based on a weak significance threshold of p< 10 5 and one based 
on a strong significance threshold of p<5 x 1 0 8 , as reported in the 
NHGRI GWAS Catalogue. 

Derivation of Bayes Factors 

Bayesian analysis provides the most suitable framework for 
combining functional characteristics (here referred to as "annota- 
tion data"), with evidence from an association study ("association 
data") [39]. We expand on our previous empirically-based 
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approach to the calculation of Bayes factors for annotation [3] to 
allow multiple functional characteristics to be considered simul- 
taneously. The posterior odds (O post ) of causality for a trait of 
interest at a given SNP are given by the ratio of the conditional 
probability of causality, given the annotation and the association 
data, to the conditional probability of non-causality: 

^ P(Causal\AnnotData,AssocData) 
P(NotCausal\AnnotData,AssocData) 

If we assume the annotation data and association data are 
independent once conditioned on causality, then the posterior 
odds become: 

P(Causal) P(AnnotData\Causal) 
P(No t Causal) P(AnnotDa ta \ No t Causal) 

P(AssocData\ Causal) 
P(Asso cData\Not Causal) 

These three products are, respectively, the prior odds before 
seeing any association and annotation data (O prior ), the Bayes 
factor for annotation data (BF annot ) and the Bayes factor for 
association data (BF, isso(: ). We note that this factorization implies 
that, while functional annotations are allowed to be enriched (or 
impoverished) for causal SNPs relative to non-causal SNPs, the 
enrichment pattern is assumed to be the same for rare versus 
common causal SNPs, and for low-effect size versus high effect size 
causal SNPs. We accept that this is an imperfect approximation, 
and it assumes among other things that SNPs are either causal or 
non-causal when in reality their effect size can be arbitrarily close 
to zero, but we note that the main limitation of our approach lies 
with the small number of GWAS hits available to us, and 
subdividing these still further according to allele frequency and 
effect size would be problematic. We also note that by "causal" 
what we actually mean is "causal or in high LD with a causal 
variant", as both the association data and the annotation data (as 
defined in our study) are affected by LD proxies. 

In our previous study [3], we noted that if one assumed that (1) 
all hits in the NHGRI GWAS Catalogue were truly causal; and (2) 
functional annotation enrichment patterns were the same for these 
known hits as for future undiscovered truly causal SNPs; then an 
empirically based estimate for BF annot for a single binary 
functional characteristic would simply be the ratio of its frequency 
in GWAS hit versus non-hit data. Here we note that if we start 
with the same two assumptions, and further assume that a true (but 
unknown) logistic model exists that relates a set of functional 
characteristics (which can be either binary or quantitative) to the 
probability that a SNP is truly causal, then one reasonable 
approach to estimating that logistic model would be via 
regularized logistic regression as described above. Once fitted, 
the estimated odds of causality to non-causality, obtained from the 
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GWAS hit and non-hit datasets, need only be multiplied by the 
prior odds of non-causality in these dataset (i.e. the ratio of the 
weighted sample sizes of GWAS non-hits to GWAS hits in these 
data) in order to obtain the Bayes factor for annotation. Here, we 
chose to weight hits and non-hits to appear of equal size, and thus 
our estimate for BF amlot is obtained direcdy as the estimated odds 
of causality to non-causality from the regularized logistic 
regression. 

Methods for estimating BF assoc from association data are 
reviewed by Stephens & Balding [39] . Here, we use the convenient 
approximation described by Wakefield [40] . 

Investigating the Model in the Context of known GWAS 

To investigate the relevance of the predictions in a variety of 
disorders we looked at the p-value distribution of SNPs according 
to their functional class in large GWAS datasets with a substantial 
fraction of GWAS significant findings. Quantile-quantile plots 
were constructed for each study with multiple lines corresponding 
to SNPs binned according to their predicted value. Predicted 
values were those derived from the non-phenotype specific 
clumped model in which GWAS hits were defined as those SNPs 
in the GWAS Catalogue with p-values of less than 5xl0~ 8 . We 
expected those SNPs with higher predicted values to be enriched 
with GWAS SNPs with more significant p-values, whereas those 
SNPs with lower predicted values would be enriched with less 
significant p-values compared to all SNPs in the GWAS. 

We also selected some SNPs shown to be associated in a large 
psoriasis meta-analysis which had not been identified in a previous 
GWAS study [18,19]. We then determined the effect on the rank 
of their Bayes Factors in the previous study derived either using 
association data or both association data and functional charac- 
teristics. 
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